Generalized Hyper-Systolic Algorithm 



A. Galli 

Max-Planck-Institut fiir Physik, Fohringer Ring 6, D-80805 Miinchen Q 



Abstract 

We generalize the hyper-systolic algorithm proposed in for abstract data structures 
on massive parallel computers with Up processors. For a problem of size V the 
communication complexity of the hyper-systolic algorithm is proportional to y/n^V, 
to be compared with UpV for the systolic case. The implementation technique is 
explained in detail and the example of the parallel matrix-matrix multiplication is 
tested on the Cray-T3D. 
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1 Introduction 



In a near future Teraflops supercomputers will be available from many vendors. A 
possible architecture to achieve such a high performance is based on massive paral- 
lelism. Many powerful processors (PE) are linked together. Each of them can access 
his own local memory which is usually relatively large. The communications between 
the PE's is performed by a fast network. Using this architecture the hope is to obtain 
a linear speedup in the number of PE's. This is, of course, not possible because the 
overhead of the communications increases with the number of PE's. 

The overhead due to the communications can be encountered in many algebraic 
problems. For example, let us assume we have two matrices A and B of size V 
which are distributed among the PE's, then the simple operation A ■ B needs inter 
processor communications. The volume of the data which have to be communicated 
is then proportional to Up keeping the size of the matrices fixed. In fact the volume 
of private data owned by one PE is proportional to V/np. Using an usual systolic 
algorithm the matrix-matrix multiplication needs that each PE communicates his 
local portions of the matrices to each PE so that the volume of the communicated 
data will be n^V/np which is proportional to Up. 

The granularity of a problem is considered to be the average size of tasks to be 
performed in parallel. A problem of size V implemented in parallel on Hp processors 
has a granularity g proportional to V/up. If the granularity of the problem is very 
large the inter processor communication becomes irrelevant compared with the time 
spent for the local computations. But if the granularity of the problem is small the 
overhead due to the communication is a serious obstacle to the wanted speedup. In 
fact, using an usual systolic algorithm, the data to be communicated is proportional 
to Vup and because we can assume that each PE can communicate data in parallel, 
the overhead time is constant in the number of PE's Vupjnp = V. The computer 
time needed to solve a problem of granularity g is then proportional tor = a~ + b-V 
where a and b are constants. Asymptotically in Hp this time reach a plateau given 
by the overhead time b ■ V. 

Of course, performance can be obtained by increasing the size of the problem so 
that the granularity remains constant. This is nice, but usually it is not what one 
needs in real applications. In fact, one may also needs speedup by more or less main- 
taining constant the size of the problem. It is then an important task to find new 
algorithms that need less inter processors communications even at low granularity. 
This is possible exploiting the fact that usually at low granularity there is abundance 
of memory. 

Recently, it was proposed an hyper- systolic ^ algorithm for organizing the data 
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communications related to the problem 



yi= Yl fi^h^j) (1) 

j over all PE 

where Xi and Ui are some data stored on the i PE and / is some function. It was 
shown that the volume of the communicated data was proportional to ^Jn^ for fixed 
size of the problem^. This algorithm can be useful in various fields of applications, 
like n— body dynamics, polymer chains, protein folding or signal processing. 

Equation (|1]) describes a small class of problems. Of course, the hyper-systolic 
algorithm can be generalized for a more large class of problems with global com- 
munication and with a more complicate data structure. We consider the following 
generalized problem 

Ci = ^ f(A, ,Bi) 
ip over all PE 

where now A, B and C are abstract data structures, / is some operations on A and B 
which has an output with the same data structure type of C and is an associative 
and commutative operation on the data structure of type C. The portions of the 
data owned by the different PE's are denoted by the subscript i^. The index imype 
characterizes the local PE. 

For example, if one considers the matrix-matrix product the same formalism can 
be applied. In this case Ai^^^^ and Ci^^^^ are the portions of the matrices A and C 
stored locally on the PE imype and Bi^ is the portion of the matrix B stored on the 
remote PE ip. The product f{Ai^y^^, Bip) = Ai^^^^^Bip is a generalized matrix prod- 
uct which takes care how the matrices are distributed among the PE's (column-like, 
row-like or block-like) and the operator is a direct sum on the respective Up linear 
subspaces of Q^^^^. 

The hyper-systolic algorithm replicas a volume of data proportional to ^Jr^y jup 
coming from the portions of A and B stored on Oi^^fTp) different PE's, and uses all 
possible combinations of them to built O(y^) different portions of C which are then 
sent back to the original PE's. The communication directions are chosen so that all 
PE's receive all needed data. In this case the amount of data to be communicated is 
only proportional to V ^^n^ for a fixed size V of the problem. Then the time needed 
to solve a problem of granularity g r = a ■ ^ ■ where a and h are some 
constants. Now the overhead time h ■ ^= vanishes asymptotically in the number of 
PE's. 



^In the original work of it is showed that the communicated data is proportional to it^J"^ if 
one increase the size of the problem linearly in rip. 
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In this paper we discuss in details the hyper-systohc algorithm for an abstract 
data structure and then we consider the useful example of the matrix-matrix product 
which is one of the most hard problem in linear algebra operations on massive parallel 
computers and also mimics all aspects of more general applications. We also report 
the gain in performance against the usual systohc algorithm tested on the Cray-T3D. 



2 Generalized Algorithm 

We consider a massive parallel computer built up by a set of Up PE's labeled by 
SpE = {ip G {0,...,'^^p — 1}} and a network connecting the PE's = {bij\i,j e 
Spe}- The index imype denotes the local PE. We define a set of three data structures 
D — {A, B, C}. Each PE owns a portion of D. We denote this portion by Di^. For 
three variables A,B & B and C E C we want to compute 

ip over all PE 

where / : {A^^ype, Bi^^p^} Ci^^^^ is some operations on the local A and B which 
has an output of type C and © : {Ci^^^^,^, Cj^^^J Qmype is an operation with 
associative and commutative properties on the data structure of type C with output 
of type C. 

The usual systohc algorithm works as follow 
Algorithm 1 Systolic Algorithm 

do Vip E SpE 

get Bip from PE ip 

od 



where the communication routine get represents a shift sequence optimized for the 
used architecture. Wc do not go into the details of how this routine organizes the 
shifts because it is standard. 

The idea of the hyper-systolic algorithm is to allocate on each PE a set of replicas 
e Dj \k G /). The set of integer I — |0, ■.■,K} contains the index of the 

I Imype tmype I J o I. 5 ' J 

replicas of Di^^^^ that we want to memorize on each PE. Each replica A^ and B^ has 
to be filled with a copy of some remote portion of A and B. Then many combinations 
of these portions can be used to find many contributions to /(Ai^^^^, B^p) . These have 
then to be sent back to the right PE's. To control all the directions of communications 
one has to ensure that all needed contributions occur. This can be done by defining 
a mapping (j) from S'p^; x / to S'pb by the following algorithm 
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Algorithm 2 



do Vip & SpE 

^tmp 

doVA; G / 

4>{ip, k) = itmp ~ 
while 0(ip, fc) < do 

k) = (j){ip, k) + Hp 

od 

^tmp 't'ij'pi k) 

od 

od 

where the = {0, 71, 7i^} with 7^ positive and integer. The dimension of the 
vector r has to be chosen with K minimal so that 

Vm G SpE ■.m = 'ji + 7i+i + ... + 7^+^ 01 m = Up - (7^ + 7^+1 + ... + 7^+^) (3) 

with < i + j < K. How to determine the appropriate vector F is discussed in [|l|. 
The best choice of the vector F is called the best basis. The best basis has dimension 
K ~ y/n^- There is no analytical recipes for finding the best basis for a given number 
of PE's. One has to try all combinations of numbers and select the ones which satisfy 
(^. But this can be a very hard computational problem for a large number of PE's. 
For the allowed configurations of the Cray-T3D we have enumerate the best basis in 
Table 1. Fortunately, there exists a regular basis which is very easy to be found and 
has dimension K ~ ^y2 ■ Up. This basis is given by 

T = {1,1,...,1,K,K,...,K) (4) 

^ . ' ^ V ' 

K K-l 

where K is the integer part of y/2np. In any case the choices of K are of order O(y^). 

The mapping is local to each PE and it is equal for each PE. It contains all 
informations about the location of the data D which has to be communicated to fill 
the K replicas. The definition (|^) ensures that all combinations of PE's occurs at 
least once. To ensure that only one PE will compute the combination ffAf^ , -Bf ^ ) 
and send it back to the right PE one has to check the multiplicity of the occurrence 
of this combination among the PE's defining a multiplicity table M going from 1^ to 
{0, 1}. All informations about the multiplicity of the operations can be read from the 
mapping 6. M takes the value when a duplicate of the combination f (Af^ , ) 
is encountered. This table controls that in the operation © the elements contributes 
only once by switching off the PE's which duplicates this operation when M(/ci, ^2) = 
0. 
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Algorithm 3 

mype: ^l); 4^{j'mypei 
do ip 0, imype 1 

do Wki, k2, ks, ki& I 
if {(p{imype, h) = (f){ip, k^) and (f>{imype, ^2) = 4>{ip, ks)) then 

M{4>{imype, ^l), 4>{imype, ^2)) = 
fi 

od 

od 

The hyper-systolic algorithm can then be organized as following: one gets from the 
remote PE's selected by the mapping the K portions of data {A, B} needed to 
built all combinations f(Af^ , Bf^ ) among the replicas. Then one sums up 
these combinations in the respective replica Cf^^^^ according to the operation © and 
controlling the multiplicity. Now one has to synchronize^ all PE's to be sure that 
all have done their job. Then one has to send back according to the exact reverse 
0(^mype, k) scqucnce all C^^ ^ and sum up them to the result local C according with 
©. 

Algorithm 4 Hyper- Systolic Algorithm 
do ykel 

get Ai^ from PE ip = (t){imype, k) and store in A\^^^^ 
get Bi^ from PE ip = (f){imype, k) and store in B^^^^^ 

od 

do Vfci ,k2El 

if M(ki, k2) = Ithen C^ = C^' © f{A'^' , 5^ ) 

\ ' ^/ I'mype i'mype " ^ ''mype' ''mype-' 



od 



synchronize 

do Vip G SpE, k E I 

if 0(2p, k) = i^ypethen 
get from PE ip and store in T G A, 



I'mype f^mype ^ 



fi 



od 



This pseudo-code can be used to implement any parallel problem satisfying eq. (|^). 
As a test example we have implemented the product of two matrices on the Cray- 
T3D. We have distributed the matrices so that each PE owns some fixed portion of 

■^Of course, the synchronization is needed only on MIMD architectures. 
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rows or columns of the matrices. We have tested the algorithm for matrices with 
dimensions going from 16 x 16 to 2048 x 2048 real entries on 2 to 256 PE's. We 
have measured the ratio between the time spent to communicate data for the systolic 
against the hyper-systolic algorithm. This ratio expresses the gain in computer time 
for the communications. The hyper-systolic algorithm communicates three times a 
volume of data equal to K ■ V (where K is the integer part of y/n^ for the best 
basis or the integer part of y/'2np for the regular basis) and the systolic algorithm 
communicates a volume of data equal to Up ■ V. In the ideal case we can assume 
that the networks is able to transfer data in parallel with the same speed for both 
algorithms then the ideal ratio is 



Of course, the hyper-systolic algorithm can not have an ideal ratio against the sys- 
tolic algorithm because it involves communications to more remote PE's than in the 
systolic case. This eventually can overload the network and slow down the commu- 
nications. However, if the topology of the network is a (i— torus with d > 2 this 
effect should be negligible because there exist many paths which connect each pair 
of PE's. Only if c? = 1 then a 1— torus is a circle some problem can arise because for 
connecting a pair of PE's there exist only two paths. In fig. 1 we plot the measured 
ratio averaged on the different dimensions of the matrices. We see that the measured 
values lies near the ideal case as expected from the topology of the Cray-T3D. 

3 Conclusion 

We have generalized the hyper-systolic algorithm proposed by |jl|. This algorithm 
is able to handle a vast class of massive parallel problems which need global com- 
munications. It reduces significantly the overhead due to the communications when 
compared with the usual systolic algorithm. As an example we have implemented 
the algebraic problem of multiplying two large matrices in parallel. In our example 
we have observed a speedup of the communications of about using the best basis 



and ^-/l using the regular basis. 
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Up 


K 


1 


2 


1 


1 


4 


2 


1 1 


8 


3 


1 1 2 


16 


4 


12 2 4 


32 


6 


1114 4 8 


64 


8 


1 1 12 3 10 8 20 4 


128 


? 


? 


256 


? 


? 



Table 1: The best basis for the configurations of the Cray-T3D at the EPFL, Switzer- 
land. The basis for 128 and 256 PE's could not be found. 
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Figure 1: Average ratio between the time spent to communicate data for the sys- 
toUc algorithm against the hyper-systohc one for the example of the matrix-matrix 
product. The squares arc obtained using the best basis, the crosses using the reg- 
ular basis. The lines represent the ideal cases for the best basis (solid) and regular 
basis (dashed), respectively. The errorbars of the average over the different matrix 
dimensions are smaller than the plotted symbols. 
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